Define \(g\) a density surrogate in a class \(\mathcal{G}\):
In the Posterior Summary framework, we first obtain the optimal dimension that approximates the Bayesian model through \(f(\widetilde{y}\mid y)\).
This is done by minimizing the loss function: \[ \widehat{\gamma} = \arg\min_{\gamma \in \Gamma} E_{\widetilde{Y}}[ \ell(\widetilde{y}; \gamma; g_{\gamma}) \mid Y] + \lambda P(\gamma) \]
The model \(\widehat{g}_{\gamma}\) indexed by \(\widehat{\gamma}\) is the density point estimate approximation of the model through \(f(\widetilde{y}\mid y)\).
This is similar to what was observed under DSS and related models. We usually would stop at this point.
But actually what is important about this step is that we can extract valuable information about the dimension of the surrogate.
Suppose \(K^* \in \mathbb{Z}_{+}\) is the dimension of \(\widehat{\gamma}\), that is defined on the class \(\Gamma^{K^*}\).
From this dimension we obtain the posterior around this estimate, which is the Posterior summary.
This is done again using a decision theoretic approach in which we obtain the posterior projection as \[ \gamma_{ps} = \arg\min_{\gamma \in \Gamma^{k^*}} \ell(\widetilde{y}; \gamma; g_{\gamma}) \]
In this case, \(\gamma_{ps}\) is the posterior summarization distribution of the parameter associated with the surrogate model.
We then can get estimates, like the posterior summarization mean from \(\gamma_{ps}\), and also credible intervals.
Its important to notice that: \[ \arg\min_{\gamma \in \Gamma} E_{\widetilde{Y}}[ \ell(\widetilde{y}; \gamma; g_{\gamma}) \mid Y] \neq E\left[\arg\min_{\gamma \in \Gamma} \ell(\widetilde{y}; \gamma; g_{\gamma})\mid Y\right] \]
We apply the posterior summarization framework to Bayesian density estimation. The idea is to project a non-parametric or overparameterized model to a lower-dimensional interpretable model. Here we use the class of finite mixture models (FMM) as surrogates since they have optimal properties and are easy to interpret.
We let \(g_{\gamma}\) to be a surrogate density defined as \[ g_{\gamma}(y) = g(y\mid \gamma) = \sum_{q = 1}^k\eta_q g_{q}(y\mid \gamma_q), \] where \(\sum_{q = 1}^k\eta_q = 1\), \(g_q(\cdot\mid\gamma_q)\) are densities and \(\gamma = (\eta_1,\dots,\eta_K,\gamma_1,\dots,\gamma_k)\).
We define the loss function as \(\ell(\widetilde{y}; \gamma; g_{\gamma}) = -\log g_{\gamma}(\widetilde{y})\).
This yields the optimal estimate \[ \widehat{\gamma} = \arg\min_{\gamma \in \Gamma} E_{\widetilde{Y}}[ -\log g_{\gamma}(\widetilde{y}) \mid Y]. \]
Let \(\widetilde{y}_1,\dots,\widetilde{y}_{\widetilde{n}} \sim f(\widetilde{y}\mid y)\) be a sample of the predictive posterior distribution. We can then approximate the optimization step above as
\[ \widehat{\gamma}^{k} = \arg\max_{\gamma \in \Gamma} \frac{1}{\widetilde{n}} \sum_{n=1}^{\widetilde{n}} \log \sum_{q=1}^{k} \eta_{q} g_{q} (\widetilde{y}_{n} | \gamma_{q}). \]
In this case, we replace the penalty on the lost function by a greedy search through different parametric of \(\Gamma^k\), for \(k = 1,2,\dots,K_{\max}\), where \(K_{\max}\) is chosen to be large.
We solve this using the EM algorithm provided by the
mclust package for \(k =
1,2,\dots,K_{\max}\).
The idea is to for search the different projections for the model of size \(k = 1,2,\dots,K_{\max}\) that best represents the model through \(f(\widetilde{y}\mid y)\).
The challenge now is to search for the surrogate with the smallest dimension that still has a good fit in relation to the posterior predictive distribution.
To achieve this goal, we propose a function that measures the discrepancy between the surrogate and the posterior predictive function \[\begin{equation} \delta_{n}^{k}(f,\widehat{g}^{k}_{\gamma}) = \log \frac{\widehat{g}^{k}_{\gamma}(\widetilde{y}_n)}{f_{\widehat{y}}(\widetilde{y}_n)},\quad n = 1,2,\dots,\widetilde{n} \end{equation}\]
A useful property of the discrepancy function is that it’s expection is an approximation of the Kullback- Liebler divergence: \[\begin{equation} E_{\widetilde{Y}}\left[\delta^k_{n}(f,\widehat{g}^{k}_{\gamma})\mid Y\right] = -\int f_{\widetilde{y}}(\tilde{y})\log\frac{f_{\widetilde{y}}(\widetilde{y})}{\widehat{g}^{k}_{\gamma}( \widetilde{y})}d{\widetilde{y}} = - \text{KL}\left(f_{\widetilde{y}}(\widetilde{y})\|\widehat{g}^k_{\gamma}(\widetilde{y})\right). \end{equation}\]
From this property we select the model with the smallest dimension before the discrepancy function deviates from
\[E_{\widetilde{Y}}\left[\delta^k_{n}(f,\widehat{g}^{k}_{\gamma})\mid Y\right] \approx 0\]
Sample \(\widetilde{y}\sim f(\widetilde{y}\mid y)\), the posterior preditive distribution as:
From the posterior predictive distribution obtain: \[ \widehat{\gamma}^{k} = \arg\max_{\gamma \in \Gamma} \frac{1}{M} \sum_{n=1}^{M} \log \sum_{q=1}^{k} \eta_{q} g_{q} (\widetilde{y}_{n} | \gamma_{q}) \]
for \(k = 1,2,\dots,K_{\max}\), resulting in a sequence of actions \(\widehat{\gamma}^{1},\widehat{\gamma}^{2},\dots,\widehat{\gamma}^{K_{\max}}\).
Generate discrepancy functions \(\delta^k_{1}(f,\widehat{g}^{k}_{\gamma}), \dots, \delta^k_{M}(f,\widehat{g}^{k}_{\gamma})\), for \(k = 1,2,\dots,K_{\max}\), where \[\delta_{n}^{k}(f,\widehat{g}^{k}_{\gamma}) = \log \frac{\widehat{g}^{k}_{\gamma}(\widetilde{y}_n)}{\widehat{f}_{\widetilde{y}}(\widetilde{y}_n)},\] in which \(\widehat{f}_{\widetilde{y}}(\widetilde{y}_n) = \frac{1}{M}\sum_{m = 1}^{M}f(\widetilde{y}_n \mid \theta^{(m)})\), and \(\widehat{g}^{k}_{\gamma}(\widetilde{y}_n) = \widehat{g}^{k}(\widetilde{y}_n\mid \widehat{\gamma}^k)\) is the surrogate.
Select the model with the best trade-off between its dimensionality and its approximation to the predictive distribution through: \[E_{\widetilde{Y}}\left[\delta^k_{n}(f,\widehat{g}^{k}_{\gamma})\mid Y\right] \approx \bar{\delta}^k(f,\widehat{g}^{k}_{\gamma}) = \frac{1}{M}\sum_{n = 1}^{M}\delta_{n}^k(f,\widehat{g}^{k}_{\gamma})\]
The selection is done visually through a plot.
This result returns an insight on the dimension on the surrogate .
We refer to this ‘’optimal’’ surrogate dimension as \(K^*\).
We can stop here if the objective is only a surrogate estimate.
We have a baseline for how far the dimension may be reduced without compromising predictive accuracy of the surrogate compared to the original model.
The lower-dimensional posterior summary is now obtained by projecting the posterior using this dimension as a reference.
\[ \gamma_{ps} = \arg\min_{\gamma \in \Gamma^{K^*}} \ell(\widetilde{y}; \gamma; g_{\gamma}) \]
For a fixed \(K^*\), we obtain the \(\gamma_{ps}\) from \(\widetilde{y}\sim f(\widetilde{y}\mid y)\) by
For this fixed \(\theta^{(m)}\sim \pi(\theta\mid y)\) we obtain a posterior predictive sample of size \(H\) as
We obtain a separate \(\gamma_{ps}^{(m)}\) for every \(m = 1,2,\dots,M\) as \[\gamma_{ps}^{(m)} = \arg \max_{\gamma \in \Gamma^{k^*}}\frac{1}{H} \sum_{h=1}^{H}\log \sum_{q=1}^{K^{*}} \eta_{q} g_{q} (\widetilde{y}_{n}^{(m)} | \gamma_{q})\]
From this process we obtain the sequence \(\gamma_{ps}^{(1)},\gamma_{ps}^{(2)}, \dots, \gamma_{ps}^{(M)}\) of posterior summaries with dimension \(K^*\), that approximate \(\gamma_{ps}\).
From this this distribution we can get and estimate by averaging over \(\gamma_{ps}^{(m)}\).
We can also obtain a density estimate by averaging over the surrogate, resulting \[ \bar{g}^{K^*}(y_i\mid \gamma_{ps}) = \frac{1}{M}\sum_{m = 1}^M g^{K^*}(y_i\mid \gamma_{ps}^{(m)}) \]
We obtain credible intervals for \(g^{K^*}(y_i\mid \gamma_{ps}^{(m)})\) from \(\gamma_{ps}^{(m)}\).
Assume \(K^*\) is the lower dimension of the surrogate model created in the first phase of our approach. Starting from this point, we may either get a projection of the cluster estimates, where \(K^*\) is the maximum number of clusters, or obtain a projection of the posterior on the cluster estimates, resulting in the uncertainty estimates for the cluster allocation.
One key feature of our method is that it gives us the ability to choose different surrogate projections based on the kind of clustering allocation procedure. This makes the procedure modular and hence more versatile than other methods. In this paper, we use two distinct cluster surrogates projections. The first is the maximum conditional probability given the optimal actions \(\widehat{\gamma}_1,\widehat{\gamma}_2,\dots,\widehat{\gamma}_{K^*}\). The second is the k-means distance, in which our case uses the posterior predictive sample to find optimum centroids from which to assign data.
In the first surrogate, we first obtain the conditional probability surrogates \(\widehat{\eta}_{c}(y_i)\) as \[\widehat{\eta}_{c}(y_i) \propto \widehat{\eta}_{c} g (y_{i} | \widehat{\gamma}_{c}), \quad y_{i},\quad i = 1, 2, \dots, n, \quad c = 1,2,\dots,K^* \]
The cluster allocation estimates \(\widehat{c}_{i}\) is then obtained as \[ \widehat{c}_{i} = \arg\max_{c}(\widehat{\eta}_{c} (y_{i})),\quad i = 1,2,\dots,n, \quad c = 1,2,\dots,K^* \]
Under this projection choice, posterior summaries for the clustering allocation is obtained by conditioning on the actions \(\gamma_{ps}^{(m)}\), for \(m = 1,2,\dots,M\), resulting in \[ \eta_{c}^{(m)}(y_i) \propto \eta_{c}^{(m)} g (y_{i} | \gamma_{ps}^{(m)}), \quad y_{i},\quad i = 1, 2, \dots, n , \quad c = 1,2,\dots,K^* \] which in turn results in the posterior projection allocations: \[ \widehat{c}^{(m)}_{i} = \arg\max_{c} (\eta_{c}^{(m)} (y_{i})),\quad y_{i},\quad i = 1, 2, \dots, n , \quad c = 1,2,\dots,K^*, \] thus resulting in an approximation of the posterior projection on the clustering allocation for the observation \(y_i\). From this result, we can obtain uncertainty quantification for each allocation or get the posterior average estimate.
where \(\widehat{\gamma}_c\) are the optimal centroids obtained as \[ \widehat{\gamma}_c,c = \arg \min_{\gamma,c}E_{\widetilde{Y}} \left[\sum_{n = 1}^{\widetilde{n}}\sum_{k = 1}^{K^*}\mathbb{1}\{c = n\} \|\widetilde{y}_n-\gamma_k\|^2\mid Y\right], \]
By solving the previous minimization problem results in \(\widehat{\gamma}_c\) that are optimal centroids we have the optimal clustering \[ \widehat{c}(y_i) = \arg\min_{\gamma_c \in \{1,2,\dots,K^*\}} || y_i - \widehat{\gamma}_c ||_{2}^{2},\quad \text{for}\quad i = 1,2,\dots,n \]
We can again obtain the posterior summarization for the cluster allocation using this projection by \[ \gamma_{c}^{(m)},c = \arg \min_{\gamma,c}E_{\widetilde{Y}} \left[\sum_{h = 1}^{H}\sum_{k = 1}^{K^*}\mathbb{1}\{c = h\} \|\widetilde{y}_{h}^{(m)}-\gamma_k\|^2\mid Y\right],\quad \quad m = 1,2,\dots,M \]
Which again we can get the posterior projection on the cluster allocation as \[ {c}^{(m)}(y_i) = \arg\min_{c \in \{1,2,\dots,K^*\}} || y_i - \gamma_{c}^{(m)} ||_{2}^{2},\quad \text{for}\quad i = 1,2,\dots,n,\quad m = 1,2,\dots,M \] which results in uncertainty quantification for the cluster alloction.
# Loading files
setwd("/Users/hb23255/Dropbox/DSS_MIX/DSS_MIX_06_2024/")
source("source/dcpossum_dens_comp.R")
source("source/dcpossum_sim_data_mix.R")
source("source/dcpossum_plots.R")
source("source/dcpossum_unc.R")
source("source/func_pred_laplace_temp.R")
source("source/dcpossum_clust.R")
source("source/dcpossum_clust_compar.R")
\[ 0.2 \times N(19,5) + 0.2 \times N(19,1) + 0.25\times N(23,1) + 0.2\times N(29,1) + 0.15\times N(33,2) \]
set.seed(1820)
# y.data.1 = data_sim_func("example_02", n = 1000)
setwd("/Users/hb23255/Dropbox/DSS_MIX/DSS_MIX_06_2024/")
y.data.1 = read.csv("source/BP_source/sim_bp/bp_1000/data_bp_01_03.csv")
y.data.1 = y.data.1$x
MFM.1 = dcpossum.MFM(y.data.1,kmax = 10, quant.sample = 1000)
DPM.1 = dcpossum.DPM(y.data.1,kmax = 10, quant.sample = 1000)
BP.1 = dcpossum.BP(y.data.1, kmax = 10, BP.run = F, quant.sample = 1000, data.bp = "sim_1000", pred.f = FALSE)
# Plot - identify optimal components
comp.1 = plot.possum.comp.uni(BP.1,DPM.1,MFM.1, title = "", kmax = 10, y.data = y.data.1, sel.K = 5)
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
comp.1
# Plot - number of components from the methods
comp.2 = plot.postcomp.uni(DPM.1,MFM.1,BP.1,K.true = 5)
## `summarise()` has grouped output by 'n.comp'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'n.comp'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'n.comp'. You can override using the
## `.groups` argument.
comp.2
possum.BP.1 = possum.unc.quant.values(BP.1, K.sel = 5)
possum.MFM.1 = possum.unc.quant.values(MFM.1, K.sel = 5)
possum.DPM.1 = possum.unc.quant.values(DPM.1, K.sel = 5)
plots.possum = plot.possum.unc.quant.dc(possum.BP.1,possum.MFM.1,possum.DPM.1, K.sel = 5,
y.data.1, index.possum = "example_02")
possum.p = plots.possum$dens.summ
possum.p
\[ H(f, g) = \frac{1}{\sqrt{2}} \left( \int \left( \sqrt{f(x)} - \sqrt{g(x)} \right)^2 \, dx \right)^{\frac{1}{2}} \]
\[ 0.4\times Laplace(-5,1.5) + 0.6\times Laplace(5,1) \]
set.seed(1820)
y.data.lap = data_sim_func("examp_laplace", n = 600)
plot(mclustBIC(y.data.lap))
# Gaussian surrogate - mclust
temp.MFM = dcpossum.MFM(y.data.lap,kmax = 10, quant.sample = 1000)
temp.DPM = dcpossum.DPM(y.data.lap,kmax = 10, quant.sample = 1000)
temp.BP = dcpossum.BP(y.data.lap, kmax = 10, BP.run = F, quant.sample = 1000, data.bp = "examp_lapalce")
# Comparison of different methods
comp.1 = possum.comparison.laplace(temp.BP,temp.MFM,temp.DPM)
comp.1
# Plot - number of components from the methods
comp.2 = plot.postcomp.uni(temp.DPM,temp.MFM,temp.BP,K.true = 2)
comp.2
# Generate possum for gaussian surrogate
possum.BP.1 = possum.unc.quant.values(temp.BP, K.sel = 2)
possum.MFM.1 = possum.unc.quant.values(temp.MFM, K.sel = 2)
possum.DPM.1 = possum.unc.quant.values(temp.DPM, K.sel = 2)
# Gaussian surrogate possum
plots.possum = plot.possum.unc.quant.dc(possum.BP.1,possum.MFM.1,possum.DPM.1,
K.sel = 2, y.data.lap, index.possum = "examp_laplace")
comp.4 = possum.laplace.pred.all(temp.BP,temp.MFM,temp.DPM, y.data.lap, scale.plot = c(-13,11),
K.sel = 2, kmax = 10, quant = c(0.025,0.975))
possum.dens = plots.possum$dens.summ
possum.dens
comp.4
comp.3 = possum.laplace.coef(temp.BP,temp.MFM,temp.DPM)
possum.par = plots.possum$param.sum
possum.par + comp.3
set.seed(1820)
y.data.app = data_sim_func("galaxy")
MFM.app = dcpossum.MFM(y.data.app,kmax = 10, quant.sample = 1000)
DPM.app = dcpossum.DPM(y.data.app,kmax = 10, quant.sample = 1000)
BP.app = dcpossum.BP(y.data.app, kmax = 10, BP.run = F, quant.sample = 1000, data.bp = "galaxy")
# Plot - identify optimal components
comp.1 = plot.possum.comp.uni(BP.app,DPM.app,MFM.app, title = "", kmax = 10, y.data = y.data.app)
comp.1
comp.2 = plot.postcomp.uni(DPM.app,MFM.app,BP.app,K.true = 0)
comp.2
possum.BP.1 = possum.unc.quant.values(BP.app, K.sel = 4)
possum.MFM.1 = possum.unc.quant.values(MFM.app, K.sel = 4)
possum.DPM.1 = possum.unc.quant.values(DPM.app, K.sel = 4)
# for gaussian surrogate
plots.possum = plot.possum.unc.quant.dc(possum.BP.1,possum.MFM.1,possum.DPM.1, K.sel = 4,
y.data = y.data.app, scale.plot = c(5.5,37), index.possum = FALSE)
possum.p = plots.possum$dens.summ
possum.p
Consider the data distribution \(\sum_{i=1}^3 \eta_i N(\mu_i, C_i)\) where \[ \eta = (0.45, 0.3, 0.25), \quad \mu_1 = \begin{pmatrix} 4 \\ 4 \end{pmatrix}, \quad \mu_2 = \begin{pmatrix} 7 \\ 4 \end{pmatrix}, \quad \mu_3 = \begin{pmatrix} 6 \\ 2 \end{pmatrix}, \] \[ C_1 = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad C_2 = R \begin{pmatrix} 2.5 & 0 \\ 0 & 0.2 \end{pmatrix} R^T \quad \text{where} \quad R = \begin{pmatrix} \cos \rho & -\sin \rho \\ \sin \rho & \cos \rho \end{pmatrix} \quad \text{with} \quad \rho = \pi / 4, \] \[ \text{and} \quad C_3 = \begin{pmatrix} 3 & 0 \\ 0 & 0.1 \end{pmatrix}. \]
set.seed(1822)
y.data.exam.03 = data_sim_func("example_mult_03_clust", n = 1000)
# plot(y.data.2[,1],y.data.2[,2])
y.data.2 = y.data.exam.03[,c(1:2)]
y.clust = y.data.exam.03[,3]
# Working
# Change 95% quantile
temp.MFM.mult = dcpossum.MFM.mult(y.data.2, kmax = 10, quant.sample = 1000)
# Working
temp.DPM.mult = dcpossum.DPM.mult(y.data.2, kmax = 10, quant.sample = 1000)
# Working
temp.SFM.mult = dcpossum.SFM.mult(y.data.2, kmax = 10, p.dim = 2, col.data = 1:2,
clust.info = 0, quant.sample = 1000,
post.size = 1000)
# Change name of this plot - mult and uni
# change - tags in dcpossum - let insert tag order
# Plot discrepancy function given number of factors
comp.estim = plot.estim.comp.mult(temp.MFM.mult,temp.DPM.mult,temp.SFM.mult, sel.K = 3, title = "Discrenpancy function")
comp.estim
# change name of this plot
# change - tags in dcpossum - let insert tag order
# Plot number of factors
comp.mult = plot.postcomp.mult(temp.DPM.mult,temp.MFM.mult,temp.SFM.mult, K.true = 3)
comp.mult
MFM.dens = temp.MFM.mult
DPM.dens = temp.DPM.mult
SFM.dens = temp.SFM.mult
plot.pred = plot_pred_dens_mult(SFM.dens,DPM.dens,MFM.dens, y.data.2, index = "example_02", sel.K = 3, plot_density = "simulation")
plot.pred
possum.SFM = possum.unc.quant.values.mult(temp.SFM.mult,K.sel = 3)
possum.DPM = possum.unc.quant.values.mult(temp.DPM.mult,K.sel = 3)
possum.MFM = possum.unc.quant.values.mult(temp.MFM.mult,K.sel = 3)
# add histogram
# point estimate
plot.possum.marg = plot.possum.uncertain.quant.dc.mult(possum.SFM, possum.MFM, possum.DPM,
K.sel = 3, y.data.2, scale.plot = FALSE,
index.possum = FALSE, quant = c(0.025,0.975))
plot.possum.marg
possum_clust = dc.possum.clust(temp.SFM.mult, y.data.2, K.sel = 3, km = TRUE)
plot.unc.clust = ggplot(possum_clust,aes(x = dat.V1, y = dat.V2, color = clust.prob)) +
geom_point(size = 0.6) +
geom_point() +
theme_light() +
xlab("") +
ylab("") +
theme(legend.position="bottom") +
# scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, .75))
scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, 0.75)) +
ggtitle("Uncertainty - soft")
plot.unc.km = ggplot(possum_clust,aes(x = dat.V1, y = dat.V2, color = clust.prob.km)) +
geom_point(size = 0.6) +
geom_point() +
theme_light() +
xlab("") +
ylab("") +
theme(legend.position="bottom") +
# scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, .75))
scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, 0.75)) +
ggtitle("Uncertainty - K-means")
optim.clust = dc.optim.clust(temp.SFM.mult, y.data.2, K.sel = 3)
optim.clust.kmeans = dc.optim.clust.kmeans(temp.SFM.mult, y.data.2, K.sel = 3)
plot.optim = ggplot(optim.clust,aes(x = dat.V1, y = dat.V2, shape = as.factor(optim.clust), color = as.factor(optim.clust))) +
geom_point(size = 0.6) +
geom_point() +
theme_light() +
xlab("") +
ylab("") +
theme(legend.position="bottom") +
ggtitle("Optimal - soft")
plot.optim.kmeans = ggplot(optim.clust.kmeans, aes(x = dat.V1, y = dat.V2, shape = as.factor(clust.km), color = as.factor(clust.km))) +
geom_point(size = 0.6) +
geom_point() +
theme_light() +
xlab("") +
ylab("") +
theme(legend.position="bottom") +
ggtitle("Optimal - kmeans")
y.data.examp = as.data.frame(y.data.exam.03)
gb = GBclust.func(y.data.2, H = 3)
true_clust = ggplot(y.data.examp, aes(x = V1, y = V2, shape = as.factor(clust_assign), color = as.factor(clust_assign))) +
geom_point(size = 0.6) +
geom_point() +
theme_light() +
xlab("") +
ylab("") +
theme(legend.position="bottom") +
ggtitle("True")
(true_clust + plot.optim + plot.optim.kmeans) / (plot.unc.clust + plot.unc.km + gb)
set.seed(1234)
y.data.thyroid = data_sim_func("thyroid")
y.data.clust = data_sim_func("thyroid_clust")
temp.MFM.mult = dcpossum.MFM.mult(y.data.thyroid, kmax = 10, quant.sample = 1000)
temp.DPM.mult = dcpossum.DPM.mult(y.data.thyroid, kmax = 10, quant.sample = 1000)
temp.SFM.mult = dcpossum.SFM.mult(y.data.thyroid, kmax = 10, p.dim = 5, col.data = 1:5,
clust.info = 0, quant.sample = 1000, post.size = 1000)
comp.estim = plot.estim.comp.mult(temp.MFM.mult,temp.DPM.mult,temp.SFM.mult, title = "Discrenpancy function")
comp.estim
comp.model = plot.postcomp.mult(temp.MFM.mult,temp.DPM.mult,temp.SFM.mult, K.true = 3)
comp.model
possum_clust = dc.possum.clust(temp.SFM.mult, y.data = y.data.thyroid, K.sel = 3, km = FALSE)
optim.clust = dc.optim.clust(temp.SFM.mult, y.data.thyroid, K.sel = 3)
#
# optim.clust.kmeans = dc.optim.clust.kmeans(temp.SFM.mult, y.data.thyroid, K.sel = 3)
category_mapping <- c("Normal" = 1, "Hypo" = 3, "Hyper" = 2)
# Transforming the categories using mutate and recode
transformed_data <- thyroid %>%
mutate(Category_Numeric = recode(Diagnosis, !!!category_mapping))
clust.sfm = unlist(temp.SFM.mult[[8]])
kmax.sfm = max(clust.sfm)
legend_title <- "Groups"
colors.temp <- c("#619CFF", "#F8766D", "#00BA38", "#F564E3", "#00BFC4","#B79F00", "#F564E3", "#FF64B0", "#00BF7D", "#A3A500")
col.sfm = colors.temp[1:kmax.sfm]
shapes.temp <- c(15, 16, 17, 3, 7, 8, 18, 4, 19, 20)
shape.sfm = shapes.temp[1:kmax.sfm]
p1 = ggplot(transformed_data) +
geom_point(aes(x = RT3U, y = T4, shape = as.factor(Category_Numeric), col = as.factor(Category_Numeric)), size = 1.2) +
theme_light() +
theme(legend.position="bottom") +
labs(title="True", x ="RT3U", y = "T4") + scale_shape_manual(legend_title, values = c(15, 16, 17)) +
scale_colour_manual(legend_title, values=c("#619CFF", "#F8766D", "#00BA38"))
p2 = ggplot(optim.clust) +
geom_point(aes(x = dat.RT3U,y = dat.T4, shape = as.factor(optim.clust), col = as.factor(optim.clust)), size = 1.2) +
theme_light() +
theme(legend.position="bottom") +
labs(title="Optimal Clustering classification -", x = "RT3U", y = "T4") + scale_shape_manual(legend_title, values = c(15, 16, 17)) +
scale_colour_manual(legend_title, values=c("#619CFF", "#F8766D", "#00BA38"))
# p2 = ggplot(optim.clust.kmeans) +
# geom_point(aes(x = dat.RT3U,y = dat.T4, shape = as.factor(optim.clust), col = as.factor(optim.clust)), size = 1) +
# theme_light() +
# theme(legend.position="bottom") +
# labs(title="Optimal Clustering classification -", x = "RT3U", y = "T4") + scale_shape_manual(legend_title, values = c(15, 16, 17)) +
# scale_colour_manual(legend_title, values=c("#619CFF", "#F8766D", "#00BA38"))
p3 = ggplot(possum_clust) +
geom_point(aes(x = dat.RT3U,y = dat.T4, col = clust.prob), size = 1.2) +
scale_color_distiller(palette = "Spectral", direction = -1, limits = c(0, max(possum_clust$clust.prob))) +
theme_light() +
theme(legend.position="bottom") +
labs(title = "Possum uncertainty estimate", x = "RT3U", y = "T4")
p4 = ggplot(optim.clust) +
geom_point(aes(x = dat.RT3U,y = dat.T4, shape = as.factor(clust.sfm), col = as.factor(clust.sfm)), size = 1.2) +
theme_light() +
theme(legend.position="bottom") +
labs(title="SFM classfication", x = "SRT3U", y = "T4") + scale_shape_manual(legend_title, values = shape.sfm) +
scale_colour_manual(legend_title, values = col.sfm)
(p1 + p2) / (p3 + p4)
true_groups = transformed_data[,7]
table(true_groups, optim.clust$optim.clust)
##
## true_groups 1 2 3
## 1 2 146 2
## 2 35 0 0
## 3 0 4 26
adjustedRandIndex(true_groups, optim.clust$optim.clust)
## [1] 0.8779971
classError(optim.clust$optim.clust, true_groups)$errorRate
## [1] 0.0372093
table(true_groups, clust.sfm)
## clust.sfm
## true_groups 1 2 3 4
## 1 1 147 0 2
## 2 34 1 0 0
## 3 0 4 15 11
adjustedRandIndex(true_groups, clust.sfm)
## [1] 0.8654924
classError(clust.sfm, true_groups)$errorRate
## [1] 0.08837209